Forschungspraktikum 1+2: Computational Social Science

Session 08: Advanced Topic Modeling

Dr. Christian Czymara

Agenda

  • Advancements of topic models
    • Biterm Topic Models
    • Keyword Assisted Topic Models
    • BERTopic Models
  • Chosing the right method
  • Tutorial: Identifying keyword topics in news articles

Recap: Topic Models

  • Inductive approach to identify word clusters in texts (topics)
  • Suited for explorative research questions
  • But
    • What if texts are short?
    • Do you really have no prior knowledge on the subject before running your model?

Topic Modelling Tools in R

Wiedemann (2022): 288

Biterm Topic Models (BTM)

Biterm Topic Models

  • Short texts (e.g., tweets) have limited word co-occurrences, making traditional topic models like LDA less effective
  • Biterm Topic Models (BTM) model word co-occurrence patterns (biterms) across the corpus, not within individual documents
  • BTM package for R by Wijffels (2023)
  • Automatically creates “garbage” topic (reduces preprocessing)

What Are Biterms?

  • A biterm is an unordered pair of words \((w_i, w_j)\) from a document
  • Example: “data science tools”
    • \((data, science)\)
    • \((data, tools)\)
    • \((science, tools)\)

Example Data: Trump (and others) on Twitter

  • Combination of two data files: A sample from Trump Twitter Archive and a sample from the sentiment140, 519 observations each.
  • Mean number of characters in Tweets: 110 (standard deviation: 56)

Preprocessing the Data

library(tidyr)

# convert to data.frame
toks_2 <- convert(dfm_tweets, to = "data.frame")
toks_2 <- toks_2 %>%
  pivot_longer(cols = c(!doc_id), names_to = "tokens")

toks_2 <- toks_2[toks_2$value > 0, ]
toks_2 <- toks_2[, c("doc_id", "tokens")]

head(toks_2)
# A tibble: 6 × 2
  doc_id tokens 
  <chr>  <chr>  
1 text1  bleh   
2 text1  feeling
3 text2  gonna  
4 text2  bangerz
5 text2  close  
6 text2  feel   

Run the BTM

  • k: Number of topics (20)
  • background: Background topic (filters common words)
library(BTM)
k <- 20
set.seed(1234)
bi_topics  <- BTM(toks_2,
                  k = k,
                  background = TRUE
                  )

terms(bi_topics, top_n = 10)
[[1]]
              token probability
1  @realdonaldtrump  0.04075617
2         #ausvotes  0.01026527
3             #swag  0.01026527
4            #vomit  0.01026527
5       @danscavino  0.01026527
6       @dcexaminer  0.01026527
7  @gemmaannestyles  0.01026527
8    @jessicaaaboyd  0.01026527
9       @just_brash  0.01026527
10       @kathviner  0.01026527

[[2]]
            token probability
1               â  0.03343312
2             bad  0.03142226
3          flight  0.03041683
4          smells  0.02941140
5       attendant  0.02890868
6        domestic  0.02840596
7  vomit-inducing  0.02614374
8         saviour  0.02589238
9          drunks  0.02538967
10        horrors  0.02538967

[[3]]
              token probability
1             trump 0.014215289
2  @realdonaldtrump 0.012821906
3         beautiful 0.011985877
4              time 0.010592494
5            senate 0.009756465
6           amazing 0.009477788
7           hillary 0.009477788
8           crooked 0.008920435
9             house 0.008920435
10             week 0.008641759

[[4]]
              token probability
1  @realdonaldtrump 0.024728358
2               win 0.014946786
3           america 0.013044813
4   congratulations 0.013044813
5         president 0.010871131
6             crime 0.010599420
7             obama 0.010327710
8             trump 0.010327710
9           #auspol 0.009784289
10         inducing 0.009512579

[[5]]
            token probability
1           white  0.04398685
2             red  0.04096509
3               â  0.03995783
4            scum  0.03760757
5  #represent1908  0.03727181
6            blue  0.03727181
7            earn  0.03727181
8            evil  0.03727181
9           gents  0.03727181
10          gimps  0.03727181

[[6]]
         token probability
1            â  0.03468296
2       speech  0.01753825
3      victory  0.01597964
4      #abbott  0.01520034
5    #turnbull  0.01520034
6  humiliating  0.01520034
7         loss  0.01520034
8       worthy  0.01520034
9      @deedre  0.01481069
10     fingers  0.01481069

[[7]]
              token probability
1               joe 0.017611825
2             biden 0.013528516
3  @realdonaldtrump 0.012507688
4         democrats 0.010976447
5          democrat 0.010466033
6             media 0.010210826
7              care 0.009955620
8            people 0.009700413
9            strong 0.009700413
10            tough 0.009445206

[[8]]
              token probability
1  @realdonaldtrump 0.022635090
2              time 0.019837809
3         president 0.012717457
4             party 0.011700264
5           tonight 0.011445965
6               lot 0.010174474
7          congress 0.008140088
8         interview 0.008140088
9            public 0.007885790
10             town 0.007885790

[[9]]
     token probability
1   abbott 0.016064185
2   #vomit 0.015809239
3      job 0.011985040
4  #auspol 0.011220200
5     hope 0.010965253
6      bit 0.009435574
7  country 0.009180627
8    don’t 0.009180627
9  michael 0.008925681
10   mouth 0.008925681

[[10]]
              token probability
1            #vomit  0.01568888
2  @realdonaldtrump  0.01568888
3             don’t  0.01135032
4              wait  0.01068286
5         president  0.01034912
6     #sydneyfringe  0.01001539
7          complete  0.01001539
8             naked  0.01001539
9           unicorn  0.01001539
10          victory  0.01001539

[[11]]
       token probability
1      trump 0.011294634
2  announced 0.010730044
3          â 0.010165453
4      talks 0.009318567
5      teach 0.009318567
6   @foxnews 0.008189386
7      juice 0.008189386
8     @boraz 0.007624796
9      alive 0.007624796
10     farts 0.007624796

[[12]]
              token probability
1  @realdonaldtrump 0.032528152
2             trump 0.019675322
3              stay 0.018888414
4       @whitehouse 0.011019334
5         president 0.007871703
6         democrats 0.007084795
7           federal 0.006560189
8             media 0.006560189
9           control 0.006035584
10           donald 0.006035584

[[13]]
         token probability
1         feel  0.03184147
2         sick  0.02763137
3       abbott  0.02631572
4       poison  0.02368441
5   government  0.02157936
6        thatâ  0.01815866
7   incredibly  0.01657988
8  forthcoming  0.01631675
9    @shellity  0.01421170
10         t.â  0.01421170

[[14]]
            token probability
1           drink  0.02991690
2          charge  0.02819773
3    @tarameakins  0.02751007
4       character  0.02751007
5         credlin  0.02751007
6       dismissed  0.02751007
7         driving  0.02751007
8            peta  0.02751007
9       reference  0.02751007
10 @brigadierslog  0.02407174

[[15]]
              token probability
1         democrats 0.025073082
2            people 0.022395634
3  @realdonaldtrump 0.018257760
4         president 0.014363291
5             trump 0.011685843
6       republicans 0.011199034
7            pelosi 0.009982012
8              wait 0.009982012
9             angry 0.009738608
10        including 0.009738608

[[16]]
              token probability
1  @realdonaldtrump 0.011853100
2           workers 0.011212565
3         democrats 0.010892297
4           country 0.010572030
5      @barackobama 0.010251762
6               tax 0.009931495
7             trump 0.008970692
8             women 0.008970692
9      announcement 0.008650425
10          justice 0.008650425

[[17]]
       token probability
1  president 0.015897052
2      china 0.012140250
3      trump 0.011273296
4        fed 0.010984311
5      gonna 0.010406341
6       hard 0.010406341
7   business 0.009539387
8   monetary 0.009539387
9    china’s 0.009250402
10      deal 0.009250402

[[18]]
       token probability
1   carolina 0.011207828
2       time 0.011207828
3      north 0.010731000
4      house 0.009777345
5      crazy 0.009538932
6     border 0.008346863
7      hours 0.008108450
8  president 0.007870036
9      phony 0.007393209
10  security 0.007393209

[[19]]
        token probability
1      people  0.05586418
2   #ausvotes  0.03874085
3       boats  0.03846013
4    cheering  0.03846013
5    rhetoric  0.03846013
6    stopping  0.03846013
7  @senthorun  0.03705658
8      pelosi  0.01375762
9       nancy  0.01263478
10    country  0.01066980

[[20]]
            token probability
1          people 0.013294506
2            news 0.012086189
3            fake 0.010273714
4  administration 0.008763318
5       americans 0.008763318
6       yesterday 0.008461239
7             guy 0.007857080
8        carolina 0.007555001
9            gain 0.007252922
10          north 0.007252922

Plot the BTM Results

  • Plot the first five topics, excluding the garbage one
  • … With 25 terms
  • … labelled by the topic proportion

Plot the BTM Results

library(textplot)
plot(bi_topics, which = 2:6, subtitle = "First 5 topics",
    labels = paste(round(bi_topics$theta*100, 2), "%", sep = ""), top_n = 15)

Keyword Assisted Topic Models (keyATM)

Keyword Assisted Topic Models

  • Combines supervised and unsupervised approaches
  • Allows adding keywords to label topics prior to fitting the model (adding domain knowledge)
  • Semisupervised model that combines a small amount of information with a large amount of unlabeled data
  • keyATM package for R by Eshima et al. (2023)

Example: Immigration in Swedish Newspapers 1945–2019

Define the Keywords

  • Simple example: Just one keyword topic (Trump)
keywords <- list(
  trump  = c("donald", "trump", "president", "america")
  )

Prepare the Data

library(keyATM)

keyATM_docs <- keyATM_read(texts = dfm_tweets)

Run the Base keyATM Model

key_topics <- keyATM(
  docs              = keyATM_docs,  # text input
  no_keyword_topics = k-1,          # number of topics without keyword topics
  keywords          = keywords,     # keywords
  model             = "base",       # select the model
  options           = list(seed = 1337)
  )

Results

top_words(key_topics)
            1_trump          Other_1     Other_2  Other_3
1     president [✓] @realdonaldtrump           â   #vomit
2  @realdonaldtrump           coming     victory    video
3         trump [✓]      @whitehouse      speech   abbott
4       america [✓]              sbw        loss @youtube
5            people             live        mind      day
6               job              ads     @deedre     wait
7          election          channel   #turnbull        â
8           country            local humiliating   george
9             obama           season      worthy    candy
10       donald [✓]           highly     #abbott    prank
                  Other_4                Other_5   Other_6     Other_7 Other_8
1                  smells                      â democrats        sick  turkey
2                     bad                  white    people        feel  tweets
3                  flight                complex     don’t      poison   tears
4                domestic         vomit-inducing     times      abbott   daily
5               attendant                saviour     media  government reading
6                  drunks               @irevolt    abbott       thatâ    hard
7                 horrors            nightmarish       eat  incredibly   ori's
8                    @smh                  glory    sleepy forthcoming      aj
9  http://t.co/q1ahadedem http://t.co/t7qao4ztck     wanna      school     war
10                      í http://t.co/4mbnyz1juw     prime     victory    it’s
     Other_9   Other_10         Other_11 Other_12               Other_13
1     pelosi     people @realdonaldtrump carolina              trump [1]
2      nancy  #ausvotes          fucking    north       @realdonaldtrump
3  interview   cheering              lol     life                   poll
4       time   stopping          ballots     book              beautiful
5    federal      boats           forget    wrong             #trump2016
6    corrupt   rhetoric          episode   hungry                    tax
7        bad @senthorun          massive    south #makeamericagreatagain
8      chuck       news             true   strong                  power
9     answer       fake          amazing tomorrow                  women
10     don’t    country             time  senator                 didn’t
         Other_14         Other_15         Other_16 Other_17 Other_18
1           drink           #vomit             word        í    white
2         driving @realdonaldtrump           people     feel        â
3    @tarameakins              joe @realdonaldtrump    gonna     scum
4            peta             time              god inducing      red
5         credlin            biden             love  tonight     evil
6          charge         remember           #vomit  #auspol    gents
7       dismissed             life            agree     rudd    gimps
8       character       projectile             wait  o'neill  majesty
9       reference         tomorrow             real  presser     earn
10 @brigadierslog            smell           united @unami22   thieve
           Other_19
1  @realdonaldtrump
2             mouth
3            person
4         announced
5              golf
6              yeah
7           michael
8               ugh
9         americans
10             time

Plot the Base keyATM Results

plot_topicprop(key_topics, show_topic = 1:5)

Example Texts

  • Extract the top documents for a given topic to illustrate the results
tweets_combined$text[top_docs(key_topics, 10)[, 1]]
 [1] "RT @PressSec: President @realDonaldTrump spoke earlier with the @FitnessGov (including @IvankaTrump, Bill Belichick, @MarianoRivera, @Hersc…"                                                                                                                                                   
 [2] "Twitter is doing nothing about all of the lies &amp; propaganda being put out by China or the Radical Left Democrat Party. They have targeted Republicans, Conservatives &amp; the President of the United States. Section 230 should be revoked by Congress. Until then, it will be regulated!"
 [3] "RT @dbongino: It’s Wednesday, October 7th 2020, and Barack Obama was DEFINITELY the most corrupt President in US history.\n#Obamagate"                                                                                                                                                          
 [4] "\"\"\"@garthdahdah: @TrumpGolfLA [THE] top luxury public golf course in the country http://t.co/MYiwkyfiVU” @realDonaldTrump     Thank you!\""                                                                                                                                                  
 [5] "This story is no longer about John McCain, it’s about our horribly treated vets. Illegals are treated better than our wonderful veterans."                                                                                                                                                      
 [6] "\"\"\"It's important that we help poor people to become independent, self-sufficient individuals who gain the benefits of work.\"\" #TimeToGetTough\""                                                                                                                                          
 [7] "“Director Brennan’s recent statements purport to know as fact that the Trump campaign colluded with a foreign power. If Director Brennan’s statement is based on intelligence he received while leading the CIA, why didn’t he include it in the Intelligence Community Assessment......"       
 [8] "\"\"\"@canoetravel: Plans for wind farm dropped at @realDonaldTrump's #Scotland golf resort: http://t.co/Klaf69uHet #travel #golf\"\"  Great news!\""                                                                                                                                           
 [9] "“The FBI received documents from Bruce Ohr (of the Justice Department &amp, whose wife Nelly worked for Fusion GPS).” Disgraced and fired FBI Agent Peter Strzok. This is too crazy to be believed! The Rigged Witch Hunt has zero credibility."                                                
[10] "Thankfully there were no trains in sight last night...â\u0080\u009c@kirstg4: Being on the vomit comet always reminds me \nof you @Super_Lenoâ\u0080\u009d"                                                                                                                                      

Number of Topics for keyATM

Eshima et al. (2023): 13

Tutorial 08: Exercises 1.-2.

Run the Covariate keyATM Model

  • Similar to STM, keyATM’s Covariate Model allows topic probabilities to vary over document-level variables
  • Exmple: Does Trump Tweet more about Trump?
  • Information stored in tweets_combined$is_trump (0: Not Trump; 1: Trump)

Run the Covariate keyATM Model

key_topics_cov <- keyATM(
  docs              = keyATM_docs,
  no_keyword_topics = k-1,
  keywords          = keywords,
  model             = "covariates",
  model_settings    = list(covariates_data    = tweets_combined,
                           covariates_formula = ~ is_trump),
  options           = list(seed = 1337)
  )

covariates_info(key_topics_cov)
Colnames: (Intercept), is_trumptrump
Standardization: non-factor
Formula: ~ is_trump

Preview:
  (Intercept) is_trumptrump
1           1             0
2           1             0
3           1             0
4           1             0
5           1             0
6           1             0

Estimate Differences

  • Examine how topic proportions differ between covariate groups
  • Additional step: Estimate the differences based on the keyATM object
  • Example: Compare topic proportions for Trump vs. non-Trump tweets
strata_topic <- by_strata_DocTopic(
  key_topics_cov,
  by_var = "is_trumptrump",
  labels = c("Not Trump", "Trump")
  )

Plot the Covariate keyATM Results

  • Plot topic proportions across covariate groups
plot(strata_topic, var_name = "is_trump", show_topic = 1)

Tutorial 08: Exercises 3.

Bidirectional Encoder Representations from Transformers (BERTopic)

BERTopic Models

  • Uses pre-trained transformer-based language models to create word clusters
  • Based on class-based Term Frequency - Inverse Document Frequency (c-TF-IDF) scores
  • Which words are typical for topic 1 and not so much for all other topics?
  • Single membership models (each text belongs to only one topic)

BERTopic Models

  • Automatically chooses number of topics, including a garbage topic
  • Number of topics can be changed with nr_topics, which merges topics after they have been created
  • Usually little preprocessing as all parts of a document are used for document embeddings (might make sense to remove noise (e.g., HTML-tags))
  • At the moment, the BERTopic library by Grootendorst (2023) is only available in Python (not R)
  • Some best practices recommended by the author

BERTopic Models

  • BERTopic supports all kinds of topic modeling techniques
  • Check out the documentation and the GitHub repository to find the right settings for your case

BERTopic models

Using BERTopic

  • Install the library via pip install bertopic
  • Use pre-trained transformer models (e.g., BERT, RoBERTa)
  • Explore customization options, such as setting nr_topics to merge similar topics
  • Will be applied in session 12 (Using Google and Amazon Web Services)

Summing up

Deciding on the Right Method

Criterion Dictionary Supervised Unsupervised BERTopic
Supervision level Predefined keywords or dictionaries Annotated training data None None
Preprocessing High (lemmatization, custom dictionary setup, etc.) Moderate (text cleaning) Moderate (text cleaning) Minimal (often raw text)
Flexibility Completely defined by user-created dictionaries Depends on training set and classifier Number of topics and priors Fine-tuning and priors
Computational power Low to medium Medium Medium High

Deciding on the Right Method

  • No universal solution: Each method (dictionary, supervised, unsupervised) has unique strengths
  • Experiment with different approaches and validate results
  • Data-driven choice: Tailor the method to the specific needs of the research question